Plant seedlings: determining plant species using image classification

Introduction

The ability to identify plant species effectively can facilitate better crop yields and responsible use of the environment. This can help in differentiating weeds from crop seedlings.

Objective

  1. Build a convolutional neural network for classifying plant seedlings.
  2. Analysis of the model and strategies to improve the model.

Data Information/Variables

images.npy - numpy array containing images of the plant seedlings
labels.csv - labels corresponding to the plants in the images

The analysis below has the following sections:

  1. Loading and importing packages
  2. Removing warnings from python notebooks
  3. Loading the dataset
  4. Printing images from each class with their corresponding labels
  5. Exploratory data analysis - mean images, distribution of classes
  6. Data preprocessing/model preparation - removing noise (gaussian blurring), normalizing the data, splitting the data into train and test, plotting images before and after preprocessing, label encoding.
  7. Convolutional neural network - building the convolutional neural network model
  8. Convolutional neural network performance improvement - analysis of model performance, improvement of the model
  9. Summary and key takeaways - final conclusions and summary of the analysis

1. Loading and importing packages

2. Removing warnings from python notebook

3. Loading the dataset

Observations

The labels dataframe has only one column (that will contain the name of the plant seedling where there is one to one correspondence between the index of the labels dataframe and the index of the image dataset). There are 4750 rows, so there will be 4750 images in the "images" dataset.

Observations

There are 4750 images in all. The images are in RGB (having 3 channels/colors) and are 128 x 128.

Observations

In the above preview, the first 10 images in the data set are displayed. The images are numpy arrays and can be viewed using matplotlib's imshow function. We can observe above that the green parts in the images are plants/seedlings and the background has some pebbles, as well as paper labels. Overall, the images do not have much contrast between the seedlings and the pebbles as they are both dark colors. The number of leaves in each image is different in number and shape.

Observations

The histogram shows that the peak is in the middle, and not much useful information at the right tail. The image has multiple peaks.

Observations

The labels dataframe has a column "Label" and has string labels which correspond to the name of the plant seedlings in the images.

Observations

There are no missing values in the labels dataset.

Observations

There are 12 unique labels in the dataset. The top label is "Loose Silky-bent" and occurs 654 times in the dataset.

Observations

As we saw earlier, the 'Loose Silky-bent' occurs most frequently in the dataset. Maize and Common wheat have the lowest counts. Some of the labels have multiple words, and some words are separated with a hyphen. We will be encoding the labels column while preparing the data for the models.

4. Printing images from each class with their corresponding labels

Observations

As observed earlier, all the images are bluish-green and some images have a black and white strip in them (likely barcode). All the plant seedlings are green, and the pebbles in the background are mostly bluish. Some of the plant seedlings have broad leaves, while some of them have very narrow and thin leaves.

5. Exploratory data analysis - mean images, distribution of classes

Observations

As observed earlier, the largest number of images are for "Loose Silky-bent" which is a plant seedling (about 13% of the data). This is followed by "Common Chickweed". While "Loose Silky-bent" is a plant seedling, the "Common Chickweed" is a weed. Since the main aim of developing the model is to be able to be able to identify whether a sample is a plant and weed (to facililate the process of crop yields), these two classes have a fairly good distribution (i.e. high number of examples) within the dataset.

We can also observe that there is no label class that does not have any images. "Maize" and "Common wheat" have the lowest number of images (~4%).

Overall, the class distribution looks fairly even, with sufficient examples for each class.

Observations and insights

  1. In the above images, the mean value of the pixels that was taken along an axis has been plotted. As we can see, even though the average pixel values are plotted, they are close to the actual distribution of the pixel values because we can see the plant seedlings in the images.
  2. The images were converted to grayscale (reduction in dimensions). This will help with execution of the models later on as the 128x 128 x 3 image is reduced to 128 x 128 while preserving edge information which will be important in the training of the model, while reducing the time needed to run the models.
  3. We do not see any significant difference in the mean images for plant seedlings and weeds (e.g. maize mean image versus common chickweed. This can be interpreted that weeds and edible crops have the same color distribution (on a grayscale) - for example, if maize is a certain color in the grayscale, then the weed species are also likely to fall on that same grayscale distribution. This does make the classification tougher as the color distribution cannot be relied upon as a distinguishing feature.

5. Data preprocessing/model preparation

This section will have the following subsections

  1. removing noise (gaussian blurring)
  2. splitting the data into train and test
  3. labels encoding
  4. normalizing the data

Observations

Gaussian blurring uses low pass filters to help smooth the image. The details of the object under consideration are smoothened over (hidden). As we can observe in the above images, the maize details in the first image are reduced in the third image (the central vein of the leaf. This can be advantageous for deep learning algorithms.

Observations

The RGB image was 128 x 128 x 3, and there were 4750 images. Here, the dataset is converted to 128 x 128 and there are 4750 images in all at the end of conversion to gray scale. The grayscale images are displayed below. However, when we compare the grayscale images to the RGB images, a visual inspection shows that the RGB images after blurring retain good contrast where the plant seedlings are more identifiable as compared to the background. Hence, we shall use RGB images for the model.

Observation

The gray scale images match up with the original images, except they have reduced dimension and have pixels from 0 - 255 in grayscale.

Observations

We have used a test_size of 0.1, so the training dataset has 4275 images and the test data set has 475 images. The labels dataset has one column, while the images are 128 x 128 x 3(RGB).

Observations

The labels dataset has been encoded to 12 classes.

Observations

There are 4275 images in the training set and 475 images in the test set. The maximum value after normalization is 1 for training set and minimum value is 0 in the training set. The training set and testing images have been reshaped so that they are compatible with keras model building. This concludes the preprocessing of the dataset for the deep learning models.

6. Convolutional neural network - building the convolutional neural network model

CNN model 1

The first model that can be built for analysis can be a simple model. This model will have 2 convolution layers and ReLu activation function will be used for the layers. This is a multiple classification problem, so the output layer (using a softmax function for probabilities) will 12 classes.

Note about model predictions

This is a multiple classification problem as there are 12 classes in the target variable.

However, overall some of the classes are weeds, and some are edible crops. As the ultimate aim of the data analysis is to increase crop yield and removing weeds, we can use an example of maize and common chickweed to determine which metric would be useful to us.

If the model predicts that image A (example) is maize, and the image is maize (this is supervised learning so we know the outcome), this is a true positive.

If the model predicts that image A (example) is common chickweed (a weed), and the image is maize (this is supervised learning so we know the outcome), this is a false negative (FN).

If the model predicts that image A (example) is maize, and the image not maize then this is a false positive(FP).

To evaluate which performance metric is most valuable for the analysis, we have to consider

(a) Predicting that a plant seedling is maize, but it is not - likelihood that weeds have been missed and will compete with crops for resources (false positive)

(b) Predicting that a plant seedling is not maize, but it is maize - loss of actual crop(false negative)

In general, we do not want to lose actual crops, but we also want to get rid of weeds - so for this particular case, accuracy (proportion of correctly classified photos) will be a useful metric.

Observations

The CNN Model 1 is a simplistic model with 2 convolution layers. The hidden layer neurons was kept between the input image size and the output layer size (i.e. between 12 and 128). As we can see, the accuracy after 5 epochs was 65%.

Observation

We can see that the non-trainable parameters are 0. The CNN had 2 dense layers, out of which one was the output layer with 12 neurons.

Observations

When CNN Model 1 was evaluated on the test set, the accuracy was 51%. This is compared to the training data set where the accuracy was 65%. The model can be further improved as shown in CNN Model 2 and CNN Model 3.

Observations

The blue curve is the visualization of the loss curve for training and the red curve is for test. Both curves are close to each other so overfitting is unlikely.

Observations

The model can be improved, but it is still performing decent as seen above.

Observation

CNN Model 2 which involved early stopping and dropout as well as maxpool had a training accuracy of 62.8% and accuracy of 60% on the test set.

Observation

The curve shows that the train/test are close together, so overfitting is unlikely. However, since the curves are still increasing as epochs increase, it implies that the model that can be improved further.

Conclusion and key take aways

The images dataset had 12 categories of plant seedlings and weeds. The mean images did not show much difference between the categories.

The grayscale images, while reducing the dimension of the image matrix (128 x 128 x 3 down to 128 x 128) resulted in images where it became harder to dicern the seedlings from their background. Usually, grayscale images present an advantage since they reduce the dimension of the images to be processed while retaining feature information. However, in this particular dataset, the original RGB dataset retained better visual contrast between the seedlings and the background. Hence, RGB images were used for the deep learning models.

Gaussian blurring was used to reduce noise in the image. Since the background had pebbles, blurring proved to be advantageous so that the edge differentiation was more pronounced between the seedling and the background, instead of edge presence between different pebbles in the background.

The first CNN model was a simplistic model consisting of convolution layers, dense layers but did not involve overfitting. The accuracy between training and testing was 65% and 51% respectively.

The second CNN model used 5 convolution layes, 2 dense layers (including the output) and also max pooling. Max pooling is useful since it pools set of features and also selects brighter pixels from the images. For our dataset, while both features the seedling pixels are brighter than the background, hence max pooling can be used.

Dropout was also utilized in the second CNN model. This procedure involves ignoring neurons during the training phase.While this technique is usually used to prevent overfitting, it was utilized here in CNN Model 2 (dropout of 0.25 and 0.3).

Overall, the CNN model 2 had an accuracy of 62.8% for training set and 60% for testing set. It can be interpreted that the model is not overfitting. This model performed better than CNN Model 1 since the accuracy on the testing set is higher and closer to the training set.

However, several improvements to the model are possible - since both the models are sort of underfitting (not capturing all the informationn in the images) one can use data augmentation where the training data images are made to undergo a horizontal flip or a rotation so that the total number of images increases for training. This coupled with transfer training (where the weights are set using a pre-built architechture and only the last couple of layers are trainined).

Overall summary: model can be further improved to achieve greater accuracy on the training as well as testing set. The CNN model is time consuming, hence efficient architechtural design of the network will pay dividends.